1 Executive Summary

This report investigates the frequencies of motor vehicle collisions occurring in New York City (NYC) according to the time of day. Our results show that the frequency of accidents occurring per day have remained fairly constant. However, an increase in motor vehicle collisions during the time period of 8am to 9am was observed, and the highest number of road fatalities occurred during 4pm-6pm. These findings reflect the NYC ‘rush hour’ when people are commuting home from work or going out in the evening. In particular, Friday 4pm-6pm had the highest number of fatalities during the week, whereas Saturday and Sunday had significantly fewer accidents compared to weekdays. The number of motor vehicle collisions occurring on Monday, Tuesday, Wednesday, and Thursday were similar. These findings inform the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS) of the times they must be most alert. Words: 146 words

2 Exploring the Dataset

2.1 Background to report

Motor vehicle collisions are a major cause of death and injury in New York (City of New York, 2021). This report aims to inform stakeholders of the most common time of day for motor vehicle collisions to occur in NYC from 2012-2021. Relevant stakeholders include the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS).

2.2 Assessment of data provenance

The Motor Vehicle Collisions crash data was sourced from NYC OpenData and was provided by the NYPD for public safety purposes (NYC OpenData, 2021). The dataset is classified as free public data and the NYC OpenData website includes thorough information on attribution, creation date, and the data generation process (NYC OpenData, 2021). Data was collected by police officers, who completed a MV-104AN report for all vehicle collisions in NYC (NYC OpenData, 2021). Only very basic data was collected from 1999-2016, but more detailed information was collected from 2016 (NYC OpenData, 2021). An additional limitation is the absence of data prior to 2012, which prevents our ability to analyse trends over a longer period. The data is reliable as it was inputted by trained police officers (NYC OpenData, 2021). However, potential human error must still be considered, and variations in data collection may exist between individual police officers. The dataset is updated daily and a ‘MVCDataDictionary’ spreadsheet records revision history (NYC OpenData, 2021).

2.3 Domain knowledge

Previous statistics indicate that car collisions are more common in NYC on weekdays during lunch time and the evening peak hour when individuals are commuting (Sullivan & Galleshaw LLP, 2021). During 9pm-3am, collisions occur more frequently on weekends (Sullivan & Galleshaw LLP, 2021). For ethics and privacy purposes, the Motor Vehicle Collisions dataset does not reveal confidential information about individuals.

2.4 Data structure

The dataset consists of 1.7 million rows and 29 columns. Each row corresponds to a motor vehicle collision, whereas the columns provide details of the collision.

3 Cleaning

library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)
library(dplyr)
nyc = read.csv("MVC.csv")
#nyc %>% glimpse()
#nyc %>% summary()
#vis_miss(nyc, warn_large_data = FALSE)

nyc %>% 
  select_if(function(x) any(is.na(x))) %>% 
  summarise_each(funs(sum(is.na(.)))) -> NAtable
kable(NAtable)
ZIP.CODE LATITUDE LONGITUDE NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED VEHICLE.TYPE.CODE.2
490687 197179 197179 17 31 2
cleannyc <- nyc[!(nyc$LONGITUDE == "" | nyc$LATITUDE == "" | nyc$LOCATION == "" | nyc$LATITUDE == 0 | nyc$LONGITUDE == 0),]
#cleannyc %>% glimpse()
#table(is.na(cleannyc))
cleannyc = na.omit(cleannyc)
#vis_miss(cleannyc, warn_large_data = FALSE)
boxplot(cleannyc[,11:18],cex.axis = 0.6, las = 1, horizontal = TRUE,par(mar= c(5, 10, 4, 2) + 0.1))

As we can see in the boxplot above, there are many outliers especially in the number of motorist injured, and number of persons injured. Upon inspection when there was the number of motorist injured, it occurred at 9/9/2013 and is a Brooklyn Bus Accident which left 43 people injured when a car collided head on with a bus. And it also turns out that this is the same entry for the outlier in number of persons injured. The reasons why it is in both persons and motorist category is because the 43 people are in the bus, therefore classified as motorists. These outliers without inspection may seem extraordinary and perhaps a possibility of being faulty data collection, however with a further glance they seem to be valid and an important part of our data analysis. In fact in comparison to the mean and median of all these columns, most of the circles shown in the graph are considered outliers. Evidently the median and mean are around 0 accidents, which is expected. Because we have so many entries in data, and the probability of being in an accident is relatively small, this graph is exposed to skewedness, and thus we cannnot say that all these data points greater than 0 are outliers.

#max(cleannyc$NUMBER.OF.MOTORIST.INJURED)
#cleannyc %>% filter(NUMBER.OF.MOTORIST.INJURED == 43)
#max(cleannyc$NUMBER.OF.PERSONS.INJURED)
#cleannyc %>% filter(NUMBER.OF.PERSONS.INJURED == 43)

Even though these outliers are valid, they will affect our aggregate data, by dragging the mean higher than it should be. This is why median is much better than using mean, as it is not as affected by high outliers. We do not care about low outliers as the base is 0 and cannot fall lower. We can also take a look into more details and the affect of these two outliers using the graph below.

fig = plot_ly(y = cleannyc$NUMBER.OF.PERSONS.INJURED, type = "box", name = "Number of Persons Injured")
fig = fig %>% add_trace(y = cleannyc$NUMBER.OF.MOTORIST.INJURED, name = "Number of Motorists Injured") %>% layout(title = "Persons Injured and Motorist Injured Outlier Analysis")
fig

5 Reflection on Data Wrangling

Without data wrangling we are unable to perform aggregates and data visualisation, due to the missing amounts of data and possible false data entries. This will lead us to make false conclusions and risk the integrity of data analysis.

6 References

City of New York. (2021). Vision Zero in New York City. Retrieved from https://www1.nyc.gov/content/visionzero/pages/.

NYC OpenData. (2021). Motor Vehicle Collisions - Crashes. Retrieved from https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95.

Sullivan & Galleshaw LLP. (2021, 2021). How Common are Car Accidents in NYC? Retrieved from https://www.sullivangalleshaw.com/common-car-accidents-nyc/.

DNA Info (2013) 43 People Injured in Bed-Stuy When Car Collides Head-on with City Bus https://www.dnainfo.com/new-york/20130909/bed-stuy/43-people-injured-bed-stuy-when-car-collides-head-on-with-city-bus/

  1. Fleet Report - Mayor’s Office of Operations. (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/operations/performance/fleet-report.page

  2. End-to-End Response Time - 911 Reporting . (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/911reporting/reports/end-to-end-repsonse-time.page

  3. NHTSA(2021). Retrieved 3 July 2021, from https://www-fars.nhtsa.dot.gov/Main/index.aspx

  4. EMS, One More Time. (2015). Retrieved 4 July 2021, from https://www.city-journal.org/html/ems-one-more-time-12793.html?wallit_nosession=1